What should be expected from feature selection in small-sample settings

نویسندگان

  • Chao Sima
  • Edward R. Dougherty
چکیده

MOTIVATION High-throughput technologies for rapid measurement of vast numbers of biological variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for feature selection, while at the same time making feature-selection algorithms less reliable. Feature selection must typically be carried out from among thousands of gene-expression features and in the context of a small sample (small number of microarrays). Two basic questions arise: (1) Can one expect feature selection to yield a feature set whose error is close to that of an optimal feature set? (2) If a good feature set is not found, should it be expected that good feature sets do not exist? RESULTS The two questions translate quantitatively into questions concerning conditional expectation. (1) Given the error of an optimal feature set, what is the conditionally expected error of the selected feature set? (2) Given the error of the selected feature set, what is the conditionally expected error of the optimal feature set? We address these questions using three classification rules (linear discriminant analysis, linear support vector machine and k-nearest-neighbor classification) and feature selection via sequential floating forward search and the t-test. We consider three feature-label models and patient data from a study concerning survival prognosis for breast cancer. With regard to the two focus questions, there is similarity across all experiments: (1) One cannot expect to find a feature set whose error is close to optimal, and (2) the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist. In practice, the latter conclusion may be more immediately relevant, since when faced with the common occurrence that a feature set discovered from the data does not give satisfactory results, the experimenter can draw no conclusions regarding the existence or nonexistence of suitable feature sets. AVAILABILITY http://ee.tamu.edu/~edward/feature_regression/

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...

متن کامل

تعیین حجم نمونه در تحقیقات پزشکی

One of the questions that most investigators specially physicians encounter when they plan a research project is determining sample sizes. Selecting a sample greater than what is required to obtain the desired results causes wastage of the sources, while selection of very small samples often leads the investigator to the results which have no practical use. Determining number ot the required sa...

متن کامل

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

The peaking phenomenon in the presence of feature-selection

For a fixed sample size, a common phenomenon is that the error of a designed classifier decreases and then increases as the number of features grows. This peaking phenomenon has been recognized for forty years and depends on the classification rule and feature-label distribution. Historically, the peaking phenomenon has been treated by assuming a fixed ordering of the features, usually beginnin...

متن کامل

Adaptive Sequential Feature Selection for Pattern Classification

Feature selection helps to focus resources on relevant dimensions of input data. Usually, reducing the input dimensionality to the most informative features also simplifies subsequent tasks, such as classification. This is, for instance, important for systems operating in online mode under time constraints. However, when the training data is of limited size, it becomes difficult to define a sin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 22 19  شماره 

صفحات  -

تاریخ انتشار 2006